Metholody and Approach

  1. Importing the dataset and understanding the dataset - Understanding mean, median, standard deviation of all the columns and their implicaitons
  2. Understanding the response variable - Popularity of the charging pool n_RFID - balanced or imbalanced, rate of popularity etcm
  3. Data preprocessing and consistency checks - Null values, object or string columns and treating etc.
  4. Univariate and Bi-variate analysis and understanding which features are more correlated with popularity of charging pools.
  5. Data preparation for modelling - stratified train, test split, scaling the data, outlier treatment
  6. Data Modelling using the following classification algorithms 6.1 Tuning Logistic Regression model hyperparameters using cross validation roc_auc score and identifying best penalty - l1, l2, elasticnet 6.2 Tuning Random Forest Hyperparameters using cross validation roc_auc score 6.3 Tuning Gradient Boosting Regression Trees model Hyperparameters using cross validation roc_auc score
  7. Visualizing roc_auc and precision-recall curves and interpreting the results for these three algorithms 7.1 Identifying optimum thresholds for these three alogirthms using maximum F1Score 7.2 Getting the metrics such as Accuracy, precision, recall at the above optimum thresholds and interpreting the results
  8. Feature importance using these three tuned models. L1 penalty regression model results in sparse coefficient and therefore can be uased for feature selection 8.1 Identiying features that can impact popularity of charging pools

Importing the dataset and understanding the structure of the dataset

Understanding response variable n_RFID- Popularity of a charging pool (1/0)

The response variable is n_RFID (popularity of charging pool)

so there are 948 unpopular stations and 323 popular stations

Data Preprocessing

All the columns are numerical data type and doesnt require any one hot encoding or dummy creation

There are no null values in the provided data

EDA

Data Preparation for the Model - Train test split, Outlier Treatment

Data Modelling

Logistic Regression

Random Forest Model

GBM Classifier

Final results of the models, visualizing roc_auc curves, selecting threshold

Visualizing roc_auc curves of these fined tuned models

From the above curves

  1. Logistic regression and gradient boosting is performing on similar level where as random forest is slightly underperforming
  2. All precision curve follows similar pattern and both precision & recalls are high in the threshold regions of 0.3 to 0.5
  3. Precision increases with the increase in threshold level where as recall decreases with threshold

Selecting the optimum threshold levels for each model

For selecting the optimum thresholds, since there is slight imbalance in the dataset, we cannot use accuracy or precision or recall. we need to use a metric which holds true with slightly im-balanced dataset like this

some of the metrics that can be used are

In this assignment we would be using F1 score for selecting optimum threshold. Lets visualize the F1 scores of the three tuned models (logistic, random forest, gradient boosting) based on these thresholds

From the above table, the maximum f1 score and corresponding thresholds are

Lets look at precision recall at these optimum thresholds for the test dataset

Preicision and recall both looks good for each of the tuned model

Features influencing the popularity of the charing pool stations

From the above table, logistic with l1 penalty can results in sparse coefficients and therefore feature selection. Lets understand the importand features using logistic regression

End of the Document